6. Sentiment Topic Modeling: BERT (Bidirectional Encoder Representations from Transformers)¶
In [1]:
#pip install bertopic
In [2]:
import os
import time
import math
import re
import sys
import requests
import multiprocessing
from pandarallel import pandarallel
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from bertopic import BERTopic
from wordcloud import WordCloud
import nltk as nltk
import ast
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import warnings
# Suppress warnings if necessary
warnings.simplefilter('once')
warnings.simplefilter('ignore')
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
2023-12-02 07:31:29.486107: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-12-02 07:31:29.486264: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-12-02 07:31:29.698302: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-12-02 07:31:30.091849: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
In [3]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 500)
In [4]:
num_processors = multiprocessing.cpu_count()
num_processors
workers = num_processors-1
print(f'Using {workers} workers')
Using 15 workers
In [5]:
pandarallel.initialize(nb_workers=workers, use_memory_fs=False, progress_bar=True)
INFO: Pandarallel will run on 15 workers. INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
1. Import Data¶
In [6]:
%%time
file_path = 'news_vader_sent.parquet'
news = pd.read_parquet(file_path)
CPU times: user 20.1 s, sys: 18.2 s, total: 38.4 s Wall time: 32.5 s
In [7]:
news.shape # (198064, 16)
Out[7]:
(198064, 18)
In [8]:
news.columns
Out[8]:
Index(['url', 'date', 'language', 'title', 'text', 'year', 'month', 'day',
'text_ner', 'text_cleaned', 'text_lemm', 'title_ner', 'title_cleaned',
'title_lemm', 'title_word_count', 'text_word_count', 'vader_sent',
'vader_comp'],
dtype='object')
In [9]:
news.sample(1, random_state = 42)[['text_ner', 'text_cleaned', 'text_lemm', 'title_ner', 'title_cleaned', 'title_lemm']]
Out[9]:
| text_ner | text_cleaned | text_lemm | title_ner | title_cleaned | title_lemm | |
|---|---|---|---|---|---|---|
| 196666 | Prosecutors in all states urge Congress to strengthen tools to fight AI child sexual abuse images Skip to contentCommunity Coverage TourHome ProMedically SpeakingBest of the WestChampions in AgBack to Our AppsCOVID 19Food for NewsTexasNew to a TipLatest CamsClosings and DelaysSend Us Your Weather PhotosTxDOT Highway ConditionsDownload the Weather AppWeather ResourcesKCBD InvestigatesSubmit a TipChad Read ShootingReagor Dykes CoverageSex Trafficking on the South PlainsLubbock County Medical E... | prosecutors states urge congress strengthen tools fight ai child sexual abuse images skip contentcommunity coverage tourhome promedically speakingbest westchampions agback appscovid newstexasnew tiplatest camsclosings delayssend us weather photostxdot highway conditionsdownload weather appweather resourceskcbd investigatessubmit tipchad read shootingreagor dykes coveragesex trafficking south plainslubbock county medical examiner school beat petestats predictionshow watchcommunitytell somethi... | prosecutor state urge congress strengthen tool fight ai child sexual abuse image skip contentcommunity coverage tourhome promedically speakingbest westchampions agback appscovid newstexasnew tiplatest camsclosings delayssend u weather photostxdot highway conditionsdownload weather appweather resourceskcbd investigatessubmit tipchad read shootingreagor dyke coveragesex traffic south plainslubbock county medical examiner school beat petestats predictionshow watchcommunitytell something goodnot... | Prosecutors in all states urge Congress to strengthen tools to fight AI child sexual abuse images | prosecutors states urge congress strengthen tools fight ai child sexual abuse images | prosecutor state urge congress strengthen tool fight ai child sexual abuse image |
2. Sentiment Topic Modeling: BERT¶
Topic modeling (i.e. LDA using gensim or ktrain) or using BERTopic
BERTopic¶
- Nature: BERTopic leverages transformer-based models, like BERT, for generating document embeddings, which capture the contextual relationships between words in a text.
- Methodology: It uses dimensionality reduction (usually UMAP) and clustering algorithms (like HDBSCAN) on top of the embeddings to find topics.
- Advantages: BERTopic excels in capturing the semantic meaning of texts, offering more nuanced and contextually relevant topics.
- Use Cases: It is well-suited for advanced topic modeling tasks where deep contextual understanding is crucial.
- Computational Requirements: Similar to BERT, BERTopic is computationally intensive and generally requires more resources.
LDA in Gensim¶
- Nature: This is a traditional topic modeling approach that assumes each document is a mixture of topics and each topic is a mixture of words.
- Methodology: It uses statistical methods to infer the latent topics in a corpus.
- Advantages: LDA in Gensim is well-established, easy to implement, and less resource-intensive compared to neural network approaches.
- Use Cases: Suitable for basic topic modeling needs where the primary goal is to identify broad topics within a large volume of text.
- Computational Requirements: Can be run efficiently on standard CPU setups.
LDA in ktrain¶
- Nature: ktrain, a wrapper for TensorFlow Keras, simplifies machine learning workflows. Its LDA implementation is similar to Gensim's but integrated within the ktrain ecosystem.
- Methodology: Utilizes statistical methods for topic modeling, akin to Gensim's LDA.
- Advantages: It provides a more user-friendly interface and integrates well with other ktrain functionalities for end-to-end machine learning tasks.
- Use Cases: Ideal for users who prefer a streamlined process for topic modeling along with other machine learning tasks, especially in a Keras/TensorFlow environment.
- Computational Requirements: Comparable to Gensim's LDA in terms of resource needs.
Summary¶
- BERTopic: Best for deep contextual understanding and advanced topic modeling, but resource-intensive.
- LDA in Gensim: A standard, widely-used method for topic modeling, balancing performance and computational efficiency.
- LDA in ktrain: Offers a more accessible and integrated approach within the ktrain framework, suitable for those working within a Keras/TensorFlow environment.
In [10]:
%%time
news['text_tokens'] = news['text_lemm'].parallel_apply(nltk.word_tokenize)
VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=13205), Label(value='0 / 13205')))…
CPU times: user 28.2 s, sys: 18.2 s, total: 46.3 s Wall time: 1min 58s
2.1. BERTopic on Positive Topics¶
In [11]:
news_po = news[news['vader_sent'] == 'positive']
In [12]:
news_po.info()
<class 'pandas.core.frame.DataFrame'> Index: 187561 entries, 0 to 198063 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 url 187561 non-null object 1 date 187561 non-null datetime64[ns] 2 language 187561 non-null object 3 title 187561 non-null object 4 text 187561 non-null object 5 year 187561 non-null int32 6 month 187561 non-null int32 7 day 187561 non-null int32 8 text_ner 187561 non-null object 9 text_cleaned 187561 non-null object 10 text_lemm 187561 non-null object 11 title_ner 187561 non-null object 12 title_cleaned 187561 non-null object 13 title_lemm 187561 non-null object 14 title_word_count 187561 non-null int64 15 text_word_count 187561 non-null int64 16 vader_sent 187561 non-null object 17 vader_comp 187561 non-null float64 18 text_tokens 187561 non-null object dtypes: datetime64[ns](1), float64(1), int32(3), int64(2), object(12) memory usage: 26.5+ MB
In [ ]:
%%time
mod_BERT_pos = BERTopic(calculate_probabilities=True, verbose=True, min_topic_size=50)
topics_pos, probabilities_pos = mod_BERT_pos.fit_transform(news_po['text_lemm'].tolist())
2023-12-02 07:34:16,117 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 0%| | 0/5862 [00:00<?, ?it/s]
2023-12-02 09:28:49,785 - BERTopic - Embedding - Completed ✓ 2023-12-02 09:28:49,788 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm 2023-12-02 09:33:27,956 - BERTopic - Dimensionality - Completed ✓ 2023-12-02 09:33:27,962 - BERTopic - Cluster - Start clustering the reduced embeddings 2023-12-02 13:12:29,687 - BERTopic - Cluster - Completed ✓ 2023-12-02 13:12:29,768 - BERTopic - Representation - Extracting topics from clusters using representation models. 2023-12-02 13:17:07,403 - BERTopic - Representation - Completed ✓
CPU times: user 15h 4min 12s, sys: 2h 16min 42s, total: 17h 20min 55s Wall time: 5h 44min 25s
In [ ]:
mod_BERT_pos.get_topic_info().head(20)
Out[ ]:
| Topic | Count | Name | Representation | Representative_Docs | |
|---|---|---|---|---|---|
| 0 | -1 | 66901 | -1_gray_ai_group_medium | [gray, ai, group, medium, use, data, technology, prnewswire, say, new] | [usd billion artificial intelligence ai social medium market expect reach عربي log remember forgot username password new create account home news news industry news region american europe arab world asia africa article press release report article submit article press release report r market data equity market global index mena index qutoes chart end day stock currency currency convertor cross rate historical currency libor mena stock commodoties oil energy economic calender research premium... |
| 1 | 0 | 3109 | 0_entrepreneur_employee_automation_data | [entrepreneur, employee, automation, data, generative, worker, job, business, ai, enterprise] | [ai adoption banking sector trend next year cio news et cio sign sign news news internet thing security next gen technology cloud compute business analytics strategy management big data mobility service apps consumer tech data center case study corporate social medium policy internet industry industry healthcare automotive manufacturing financial service retail ites banking case study case study digital transformation analytics ai rpa iot customer experience datacenter cloud datacenter ai da... |
| 2 | 1 | 3085 | 1_npr_radio_donate_schedule | [npr, radio, donate, schedule, listen, classical, donation, air, wunc, membership] | [meta lean wisdom crowd ai model release search query show search connect u contact u contest rule local host staff prt newsletter public file u contact u contest rule local host staff prt newsletter public file event community calendar public announcement submit event give take community calendar public announcement submit event give take listen listen live podcast directory listen live podcast directory news local regional oklahoma engage stateimpact oklahoma weather traffic npr national n... |
| 3 | 2 | 2459 | 2_ment_cision_entertain_overviewview | [ment, cision, entertain, overviewview, overview, general, resource, gdpr, consumer, transportation] | [launch denver delivery center hire people continue hyper growth data science solution business resource blog journalist log sign data privacy send release news product overview distribution pr newswire cision communication cloud cision ir product contact general inquiry request demo editorial bureau partnership medium inquiry worldwide office search search type field list search result appear automatically update type search content result found please change search term try news focus brow... |
| 4 | 3 | 2058 | 3_patient_healthcare_care_health | [patient, healthcare, care, health, medical, clinical, hospital, doctor, physician, medicine] | [healthcare ai advance rapidly american notice progress venturebeat skip main content event market u venturebeat homepage subscribe artificial intelligence view ai ml deep learn auto ml data label synthetic data conversational ai nlp text speech security view data security privacy network security privacy software security computer hardware security cloud data storage security data infrastructure view data science data management data storage cloud big data analytics data network automation ... |
| 5 | 4 | 2021 | 4_chatgpt_gpt_openai_chatbot | [chatgpt, gpt, openai, chatbot, text, user, write, tab, answer, chatbots] | [us chatgpt incredible way use ai power chatbot chatgpt trendingopenaixq valkyriedodo bird extinctgreen chatgptchatgpt incredible way use ai power chatbotyou heard chatgpt know use use idea get mcfadden feb pm estcreated feb pm image us engineering ipopba istock chatgpt understand generate human like chatbot use handy little internet search monthly paid professional experimental package available advanced language model developed openai cut edge natural language processing capability revolut... |
| 6 | 5 | 1765 | 5_market_analysis_player_growth | [market, analysis, player, growth, forecast, global, key, corporation, size, report] | [global ai social medium market comprehensive study impact analysis business strategy growth share size current well future challenge haitian caribbean news network skip content thursday dec break news global shower gel market analysis share circumstance covid outbreak forecast performance costume market witness huge growth beedpan pierre cardin zara worldwide image recognition market data management supply chain analytics forecasting global search advertising software market late impact ana... |
| 7 | 6 | 1272 | 6_student_teacher_classroom_education | [student, teacher, classroom, education, teach, educator, school, essay, chatgpt, cheat] | [paper exam chatbot ban college seek chatgpt proof assignment skip careslive watchweather day camslive newscastssubmit story ideasubmit photo videosfind call sport blitzfriday night endzonebroncosair forcestats predictionshow watchadvertise usstation jobskktv news carescommunity calendargood news fridayvideomeet teamcontact usmr foodlatest dailytv listingscircle country music lifestylegray dc healthpress releasespaper exam chatbot ban college seek chatgpt proof assignmentslori anne salem ass... |
| 8 | 7 | 1256 | 7_human_bias_could_humanity | [human, bias, could, humanity, ai, extinction, think, intelligence, might, risk] | [golden opportunity shape ai national priority answer vital question future ai humanity get say ai ethic ai lawsubscribe sign inbetathis beta experience may opt click heremore forbesjun edtgenerative ai cybersecurity friend foejun edtyes duty judge court forewarn lawyer potential pitfall use generative ai legal work asks ai ethic ai lawjun edtfostering international collaboration ai healthcarejun edttravel summer fun google chatgpt ai companionsjun edtis cod education know dead may edtthe ri... |
| 9 | 8 | 1078 | 8_newswires_presswire_ein_guinea | [newswires, presswire, ein, guinea, dakota, virginia, carolina, distribution, south, north] | [ai new ai crypto announces launch ein presswire different well work testimonial contact ein presswire news pricing distribution distribution overview medium database major news site tv radio station international newswires newswires industry newswires country newswires state mobile apps newsplugin live feed sample distribution report press release feature industry country state archive newswires international newswires newswires industry agriculture airline automotive banking book publishin... |
| 10 | 9 | 1048 | 9_ago_hour_bestreviews_nexstar | [ago, hour, bestreviews, nexstar, story, weather, newsfeed, file, video, ap] | [chatgpt maker openai sign deal ap license news story wsyr skip content wsyr syracuse sign syracuse sponsor toggle menu open navigation close navigation search please enter search term primary menu local news contact newschannel watch story micron come clay northern ny news newsmakers andrew donovan money pocket rick reagan state news national news politics hill local election headquarters washington dc ny capitol news russia ukraine conflict entertainment automotive news newsletter regional... |
| 11 | 10 | 1030 | 10_bard_google_chatbot_chatgpt | [bard, google, chatbot, chatgpt, pichai, response, answer, pixel, tab, bing] | [google bard v chatgpt win ai battle beebom skip content beebom search news review game minecraft discord best mobile iphone snapchat android iphone internet metaverse alternative pc linux mac window u contact u beebom career advertise privacy policy disclaimer apple facebook feature google iphone microsoft samsung whatsapp window xiaomi ai home ai google bard v chatgpt win ai battle google bard v chatgpt win ai battle upanishad sharma last update march pm come ai landscape chatgpt reign sup... |
| 12 | 11 | 972 | 11_market_artificial_intelligence_analysis | [market, artificial, intelligence, analysis, size, forecast, growth, player, report, global] | [artificial intelligence service market witness huge growth international business machine sap google watch news contact u u watch news market report analytics news market report industry analytics industry report market research business opportunity emerge trend growth prospect homeglobal newsartificial intelligence service market witness huge growth international business machine sap google artificial intelligence service market witness huge growth international business machine sap google... |
| 13 | 12 | 913 | 12_regulation_cookie_government_law | [regulation, cookie, government, law, federal, white, agency, house, cooky, administration] | [next ai regulation uk privacy protection uk home uk privacy contributor article share forward article save file pocket linkedin twitter facebook follow question event upcoming event may unpack regulation trend investment fund market seminar hong kong hong kong event print translate translation uk next ai regulation uk may giulia trojano withers llp linkedin connection author print article need register login march uk government publish white paper detail pro innovation approach ai regulatio... |
| 14 | 13 | 910 | 13_nvidia_gpus_chip_huang | [nvidia, gpus, chip, huang, gpu, dgx, supercomputer, compute, rtx, jensen] | [nvidia ceo highlight chip historic wave generative ai computex venturebeat skip main content event video special issue subscribe venturebeat homepage game development view program o host platform metaverse view virtual environment technology vr headset gadget virtual reality game game hardware view chipsets processing unit headset controller game pc display console game business view game publishing game monetization merger acquisition game release special event game workplace late game rev... |
| 15 | 14 | 866 | 14_hunt_rank_connectstoriestech_hoursgive | [hunt, rank, connectstoriestech, hoursgive, teamoffice, discusscollect, insign, guidechecklists, sooncheck, teamvisit] | [enhance work ai product huntproductsbest productsdiscover best product monthtopicsbrowse product topicscoming sooncheck launch come soonbuilding progresssee maker currently curated communitytime travelmost love product best product hunt date late questionsanswer interest questionslaunch guidechecklists pro tip question find support connectstoriestech news interview tip note product hunt teamoffice hoursgive feedback directly product teamvisit streaksthe active community membershall famegold... |
| 16 | 15 | 826 | 15_covid_coronavirus_virus_outbreak | [covid, coronavirus, virus, outbreak, vaccine, disease, pandemic, researcher, spread, symptom] | [artificial intelligence covid machine save u washington post skip main contentsearch dy darknesssign inprofilesign inprofilenext articlescoronavirus pandemicasia pacificcovid death soar jakarta graveyard run sp politics whole lot hurt fauci warns covid surge offer blunt assessme europenew pandemic lockdown england restriction return across europeeuropewith coronavirus explode europe hospital calculate long th europehow coronavirus make nearly impossible renounce house sidestep fda distribut... |
| 17 | 16 | 790 | 16_nasdaq_symbol_watchlist_quote | [nasdaq, symbol, watchlist, quote, add, amzn, gme, aapl, tsla, amc] | [ai stock touch foot pole hint nvidia nasdaq skip main content nasdaq weekly macro market activity market activity stock option etf mutual fund index commodity cryptocurrency currency future fix income global market market regulation regulation european regulation quick link real time quote hour quote pre market quote nasdaq symbol screener online broker glossary sustainable bond network symbol change history ipo performance ownership search dividend history invest list fundinsight market ev... |
| 18 | 17 | 780 | 17_music_song_artist_musician | [music, song, artist, musician, musical, drake, lyric, sound, songwriter, songtradr] | [youtube announces ai music principle launch youtube music ai incubator artist songwriter producer universal music group skip photo videossturgis rallydigital artshealth medicallaw weather appclosingsweather camsweather bloggood morning kota territorycooking ericfood drinkmr foodsheridan cookswine minutesportspigskin previewbig ol fishfriday night hikeathlete weekstats predictionshow resultscommunity usmeet teamcareerssubmit storysubmit photo videosdigital schedulecovid local businesscircle ... |
| 19 | 18 | 776 | 18_wfmz_lehigh_berk_valley | [wfmz, lehigh, berk, valley, traffic, tv, allentown, wdpn, freddy, corridor] | [north summit capital ceo actionable insight another definition ai technology news permission edit article edit close sign log dashboard logout account dashboard profile save item logout home news coronavirus info lehigh valley berk regional school u world sunrise inside town espanol case miss recall miss person good news weather forecast hour hour local radar weather channel stream river level pocono camera school business closing send weather report traffic live stream camera camera alert ... |
In [ ]:
positive_topic_df = pd.DataFrame(mod_BERT_pos.get_topic_info())
In [ ]:
print(positive_topic_df.shape)
(699, 5)
In [ ]:
positive_topic_df.head()
Out[ ]:
| Topic | Count | Name | Representation | Representative_Docs | |
|---|---|---|---|---|---|
| 0 | -1 | 66901 | -1_gray_ai_group_medium | [gray, ai, group, medium, use, data, technology, prnewswire, say, new] | [usd billion artificial intelligence ai social medium market expect reach عربي log remember forgot username password new create account home news news industry news region american europe arab world asia africa article press release report article submit article press release report r market data equity market global index mena index qutoes chart end day stock currency currency convertor cross rate historical currency libor mena stock commodoties oil energy economic calender research premium... |
| 1 | 0 | 3109 | 0_entrepreneur_employee_automation_data | [entrepreneur, employee, automation, data, generative, worker, job, business, ai, enterprise] | [ai adoption banking sector trend next year cio news et cio sign sign news news internet thing security next gen technology cloud compute business analytics strategy management big data mobility service apps consumer tech data center case study corporate social medium policy internet industry industry healthcare automotive manufacturing financial service retail ites banking case study case study digital transformation analytics ai rpa iot customer experience datacenter cloud datacenter ai da... |
| 2 | 1 | 3085 | 1_npr_radio_donate_schedule | [npr, radio, donate, schedule, listen, classical, donation, air, wunc, membership] | [meta lean wisdom crowd ai model release search query show search connect u contact u contest rule local host staff prt newsletter public file u contact u contest rule local host staff prt newsletter public file event community calendar public announcement submit event give take community calendar public announcement submit event give take listen listen live podcast directory listen live podcast directory news local regional oklahoma engage stateimpact oklahoma weather traffic npr national n... |
| 3 | 2 | 2459 | 2_ment_cision_entertain_overviewview | [ment, cision, entertain, overviewview, overview, general, resource, gdpr, consumer, transportation] | [launch denver delivery center hire people continue hyper growth data science solution business resource blog journalist log sign data privacy send release news product overview distribution pr newswire cision communication cloud cision ir product contact general inquiry request demo editorial bureau partnership medium inquiry worldwide office search search type field list search result appear automatically update type search content result found please change search term try news focus brow... |
| 4 | 3 | 2058 | 3_patient_healthcare_care_health | [patient, healthcare, care, health, medical, clinical, hospital, doctor, physician, medicine] | [healthcare ai advance rapidly american notice progress venturebeat skip main content event market u venturebeat homepage subscribe artificial intelligence view ai ml deep learn auto ml data label synthetic data conversational ai nlp text speech security view data security privacy network security privacy software security computer hardware security cloud data storage security data infrastructure view data science data management data storage cloud big data analytics data network automation ... |
In [20]:
from google.cloud import storage
In [21]:
positive_topic_df.to_parquet('bert_po_topic_info.parquet')
In [22]:
# Google Cloud Storage details
bucket_name = 'nlp-final'
file_path = 'bert_po_topic_info.parquet' # This is the name the file will have in GCS
local_file_path = 'bert_po_topic_info.parquet' # Path to the local file you just saved
# Create a GCS Client
storage_client = storage.Client()
# Get the bucket
bucket = storage_client.get_bucket(bucket_name)
# Create a blob object from the filepath
blob = bucket.blob(file_path)
# Upload the file
blob.upload_from_filename(local_file_path)
In [23]:
news_po['bert_topics'] = mod_BERT_pos.topics_
# news_po['bert_topics_words'] = news_pos['bert_topics'].apply(lambda x: mod_BERT_pos.get_topic(x))
In [24]:
news_po.sample(3, random_state = 42)
Out[24]:
| url | date | language | title | text | year | month | day | text_ner | text_cleaned | text_lemm | title_ner | title_cleaned | title_lemm | title_word_count | text_word_count | vader_sent | vader_comp | text_tokens | bert_topics | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 29974 | https://finance.yahoo.com/news/openai-tentacles-hundreds-companies-heres-173000745.html | 2023-05-03 | en | OpenAI has its tentacles in hundreds of companies. Here's how it's making them more productive. | OpenAI has its tentacles in hundreds of companies. Here's how it's making them more productive. HOME MAIL NEWS FINANCE SPORTS ENTERTAINMENT LIFE SEARCH SHOPPING YAHOO PLUS MORE... Yahoo Finance Yahoo Finance Sign in Mail Sign in to view your mail Finance Watchlists My Portfolio Crypto Yahoo Finance Plus Dashboard Research Reports Inv... | 2023 | 5 | 3 | OpenAI has its tentacles in hundreds of companies. Here s how it s making them more productive. HOME MAIL NEWS FINANCE SPORTS ENTERTAINMENT LIFE SEARCH SHOPPING YAHOO PLUS MORE ... Yahoo Finance Yahoo Finance Sign in Mail Sign in to view your mail Finance Watchlists My Portfolio Crypto Yahoo Finance Plus Dashboard Research Reports Investment Ideas Community Insights Webinars Blog News Latest News Yahoo Finance Originals Stock Market News Earnings Politics Economic News Morning Brief Personal... | openai tentacles hundreds companies making productive home mail news finance sports entertainment life search shopping yahoo plus yahoo finance yahoo finance sign mail sign view mail finance watchlists portfolio crypto yahoo finance plus dashboard research reports investment ideas community insights webinars blog news latest news yahoo finance originals stock market news earnings politics economic news morning brief personal finance crypto news bidenomics report card screeners saved screener... | openai tentacle hundred company make productive home mail news finance sport entertainment life search shopping yahoo plus yahoo finance yahoo finance sign mail sign view mail finance watchlists portfolio crypto yahoo finance plus dashboard research report investment idea community insight webinars blog news late news yahoo finance original stock market news earnings politics economic news morning brief personal finance crypto news bidenomics report card screener save screener equity screene... | OpenAI has its tentacles in hundreds of companies. Here s how it s making them more productive. | openai tentacles hundreds companies making productive | openai tentacle hundred company make productive | 6 | 1439 | positive | 0.9988 | [openai, tentacle, hundred, company, make, productive, home, mail, news, finance, sport, entertainment, life, search, shopping, yahoo, plus, yahoo, finance, yahoo, finance, sign, mail, sign, view, mail, finance, watchlists, portfolio, crypto, yahoo, finance, plus, dashboard, research, report, investment, idea, community, insight, webinars, blog, news, late, news, yahoo, finance, original, stock, market, news, earnings, politics, economic, news, morning, brief, personal, finance, crypto, news... | -1 |
| 124108 | https://www.wflx.com/prnewswire/2023/01/04/bright-direction-dental-selects-overjet-ai-elevate-patient-care/ | 2023-01-04 | en | Bright Direction Dental Selects Overjet AI to Elevate Patient Care | Bright Direction Dental Selects Overjet AI to Elevate Patient Care\n\nSkip to contentNewsWeatherHurricane GuideTrafficSportsCalendarSouth Florida WeekendWatch LiveWatch LiveHomeNewsNationalEntertainmentWeatherHurricane GuideSouth Florida WeekendSportsAbout UsContact UsNextGen TVProgramming ScheduleLatest NewscastsPowerNationCircle - Country Music & LifestyleGray DC BureauInvestigate TVPress ReleasesBright Direction Dental Selects Overjet AI to Elevate Patient CarePublished: Jan. 4, 2023 at 9... | 2023 | 1 | 4 | Bright Direction Dental Selects Overjet AI to Elevate Patient Care Skip to Florida WeekendWatch LiveWatch GuideSouth Florida WeekendSportsAbout UsContact UsNextGen TVProgramming ScheduleLatest Country Music LifestyleGray DC BureauInvestigate TVPress ReleasesBright Direction Dental Selects Overjet AI to Elevate Patient CarePublished Jan., at AM EST Updated hours agoThe DSO embraced technological innovation and partnered with Overjet for AI powered radiograph analysis, clinical insights, and o... | bright direction dental selects overjet ai elevate patient care skip florida weekendwatch livewatch guidesouth florida weekendsportsabout uscontact usnextgen tvprogramming schedulelatest country music lifestylegray dc bureauinvestigate tvpress releasesbright direction dental selects overjet ai elevate patient carepublished est updated hours agothe dso embraced technological innovation partnered overjet ai powered radiograph analysis clinical insights operational efficiency boston prnewswire ... | bright direction dental selects overjet ai elevate patient care skip florida weekendwatch livewatch guidesouth florida weekendsportsabout uscontact usnextgen tvprogramming schedulelatest country music lifestylegray dc bureauinvestigate tvpress releasesbright direction dental selects overjet ai elevate patient carepublished est update hour agothe dso embrace technological innovation partner overjet ai power radiograph analysis clinical insight operational efficiency boston prnewswire bright d... | Bright Direction Dental Selects Overjet AI to Elevate Patient Care | bright direction dental selects overjet ai elevate patient care | bright direction dental selects overjet ai elevate patient care | 9 | 418 | positive | 0.9989 | [bright, direction, dental, selects, overjet, ai, elevate, patient, care, skip, florida, weekendwatch, livewatch, guidesouth, florida, weekendsportsabout, uscontact, usnextgen, tvprogramming, schedulelatest, country, music, lifestylegray, dc, bureauinvestigate, tvpress, releasesbright, direction, dental, selects, overjet, ai, elevate, patient, carepublished, est, update, hour, agothe, dso, embrace, technological, innovation, partner, overjet, ai, power, radiograph, analysis, clinical, insigh... | 122 |
| 36914 | https://brandequity.economictimes.indiatimes.com/news/digital/regulators-dust-off-rule-books-to-tackle-generative-ai-like-chatgpt/100447567 | 2023-05-23 | en | Ai Regulation: Regulators dust off rule books to tackle generative AI like ChatGPT, ET BrandEquity | \n\n\nAi Regulation: Regulators dust off rule books to tackle generative AI like ChatGPT, ET BrandEquity\n\n \n\nX\n\n\nWe use cookies to ensure best experience for you\nWe use cookies and other tracking technologies to improve your browsing experience on our site, show personalize content and targeted ads, analyze site traffic, and understand where our audience is coming from. You can also read our privacy policy, We use cookies to ensure the best experience for you on our website.\nBy choo... | 2023 | 5 | 23 | Ai Regulation Regulators dust off rule books to tackle generative AI like ChatGPT, ET BrandEquity X We use cookies to ensure best experience for you We use cookies and other tracking technologies to improve your browsing experience on our site, show personalize content and targeted ads, analyze site traffic, and understand where our audience is coming from. You can also read our privacy policy, We use cookies to ensure the best experience for you on our website. By choosing I accept, or by c... | ai regulation regulators dust rule books tackle generative ai like chatgpt et brandequity use cookies ensure best experience use cookies tracking technologies improve browsing experience site show personalize content targeted ads analyze site traffic understand audience coming also read privacy policy use cookies ensure best experience website choosing accept continuing website consent use cookies terms conditions analytics performance cookies targeted advertising cookies login get app news ... | ai regulation regulator dust rule book tackle generative ai like chatgpt et brandequity use cooky ensure best experience use cooky track technology improve browsing experience site show personalize content target ad analyze site traffic understand audience come also read privacy policy use cooky ensure best experience website choose accept continue website consent use cooky term condition analytics performance cooky target advertising cooky login get app news marketingmediathe people pitch r... | Ai Regulation Regulators dust off rule books to tackle generative AI like ChatGPT, ET BrandEquity | ai regulation regulators dust rule books tackle generative ai like chatgpt et brandequity | ai regulation regulator dust rule book tackle generative ai like chatgpt et brandequity | 13 | 868 | positive | 0.9976 | [ai, regulation, regulator, dust, rule, book, tackle, generative, ai, like, chatgpt, et, brandequity, use, cooky, ensure, best, experience, use, cooky, track, technology, improve, browsing, experience, site, show, personalize, content, target, ad, analyze, site, traffic, understand, audience, come, also, read, privacy, policy, use, cooky, ensure, best, experience, website, choose, accept, continue, website, consent, use, cooky, term, condition, analytics, performance, cooky, target, advertis... | 12 |
Topic Visualization¶
In [32]:
fig = mod_BERT_pos.visualize_topics()
fig.write_html("bertopic_visualization.html") # For saving as interactive HTML
fig.show()
Topic Frequency¶
In [33]:
fig = mod_BERT_pos.visualize_barchart()
fig.write_html("topic_frequency.html")
Topic Hierarchy¶
In [34]:
fig = mod_BERT_pos.visualize_hierarchy()
fig.write_html("topic_hierarchy.html")
Topic Similarity¶
In [35]:
fig = mod_BERT_pos.visualize_heatmap()
fig.write_html("topic_similarity.html")
Intertopic Distance Map¶
In [36]:
fig = mod_BERT_pos.visualize_topics()
fig.write_html("intertopic_distance_map.html")
In [26]:
print("Number of topics:", mod_BERT_pos.get_topic_freq().shape[0])
Number of topics: 699
In [30]:
news_po.to_parquet('news_bert_po.parquet')
In [31]:
# Google Cloud Storage details
bucket_name = 'nlp-final'
file_path = 'news_bert_po.parquet' # This is the name the file will have in GCS
local_file_path = 'news_bert_po.parquet' # Path to the local file you just saved
# Create a GCS Client
storage_client = storage.Client()
# Get the bucket
bucket = storage_client.get_bucket(bucket_name)
# Create a blob object from the filepath
blob = bucket.blob(file_path)
# Upload the file
blob.upload_from_filename(local_file_path)
3. Positive Sentiment Analysis Overtime¶
3.1. Understanding the Main Topics¶
1. Topic Distribution¶
In [38]:
news_po[['text_ner', 'bert_topics']].sample(3, random_state = 42)
Out[38]:
| text_ner | bert_topics | |
|---|---|---|
| 29974 | OpenAI has its tentacles in hundreds of companies. Here s how it s making them more productive. HOME MAIL NEWS FINANCE SPORTS ENTERTAINMENT LIFE SEARCH SHOPPING YAHOO PLUS MORE ... Yahoo Finance Yahoo Finance Sign in Mail Sign in to view your mail Finance Watchlists My Portfolio Crypto Yahoo Finance Plus Dashboard Research Reports Investment Ideas Community Insights Webinars Blog News Latest News Yahoo Finance Originals Stock Market News Earnings Politics Economic News Morning Brief Personal... | -1 |
| 124108 | Bright Direction Dental Selects Overjet AI to Elevate Patient Care Skip to Florida WeekendWatch LiveWatch GuideSouth Florida WeekendSportsAbout UsContact UsNextGen TVProgramming ScheduleLatest Country Music LifestyleGray DC BureauInvestigate TVPress ReleasesBright Direction Dental Selects Overjet AI to Elevate Patient CarePublished Jan., at AM EST Updated hours agoThe DSO embraced technological innovation and partnered with Overjet for AI powered radiograph analysis, clinical insights, and o... | 122 |
| 36914 | Ai Regulation Regulators dust off rule books to tackle generative AI like ChatGPT, ET BrandEquity X We use cookies to ensure best experience for you We use cookies and other tracking technologies to improve your browsing experience on our site, show personalize content and targeted ads, analyze site traffic, and understand where our audience is coming from. You can also read our privacy policy, We use cookies to ensure the best experience for you on our website. By choosing I accept, or by c... | 12 |
In [39]:
news_po['bert_topics'].value_counts(ascending = False).reset_index(name = 'count')
Out[39]:
| bert_topics | count | |
|---|---|---|
| 0 | -1 | 66901 |
| 1 | 0 | 3109 |
| 2 | 1 | 3085 |
| 3 | 2 | 2459 |
| 4 | 3 | 2058 |
| ... | ... | ... |
| 694 | 693 | 51 |
| 695 | 694 | 50 |
| 696 | 695 | 50 |
| 697 | 696 | 50 |
| 698 | 697 | 50 |
699 rows × 2 columns
In [40]:
news_po['bert_topics'].value_counts(ascending = False, normalize = True).reset_index(name = 'portion')
Out[40]:
| bert_topics | portion | |
|---|---|---|
| 0 | -1 | 0.356689 |
| 1 | 0 | 0.016576 |
| 2 | 1 | 0.016448 |
| 3 | 2 | 0.013110 |
| 4 | 3 | 0.010972 |
| ... | ... | ... |
| 694 | 693 | 0.000272 |
| 695 | 694 | 0.000267 |
| 696 | 695 | 0.000267 |
| 697 | 696 | 0.000267 |
| 698 | 697 | 0.000267 |
699 rows × 2 columns
2. Topic related information: Interpretation¶
- Topic: Each topic is typically assigned a unique identifier (an integer). Special attention should be paid to topic -1, as it often represents outliers or miscellaneous text.
- Count: This indicates the number of documents associated with each topic. Topics with a high count are more prevalent in your dataset.
- Name: Generated based on the most frequent and representative words of each topic. These names give a quick idea of what the topic is about.
- Representation: Shows key words that are characteristic of the topic.
- Representative_Docs: Provides documents (or parts of them) that are most representative of the topic. These can be used to understand the context in which the topic keywords appear.
In [41]:
positive_topic_df.head(10)
Out[41]:
| Topic | Count | Name | Representation | Representative_Docs | |
|---|---|---|---|---|---|
| 0 | -1 | 66901 | -1_gray_ai_group_medium | [gray, ai, group, medium, use, data, technology, prnewswire, say, new] | [usd billion artificial intelligence ai social medium market expect reach عربي log remember forgot username password new create account home news news industry news region american europe arab world asia africa article press release report article submit article press release report r market data equity market global index mena index qutoes chart end day stock currency currency convertor cross rate historical currency libor mena stock commodoties oil energy economic calender research premium... |
| 1 | 0 | 3109 | 0_entrepreneur_employee_automation_data | [entrepreneur, employee, automation, data, generative, worker, job, business, ai, enterprise] | [ai adoption banking sector trend next year cio news et cio sign sign news news internet thing security next gen technology cloud compute business analytics strategy management big data mobility service apps consumer tech data center case study corporate social medium policy internet industry industry healthcare automotive manufacturing financial service retail ites banking case study case study digital transformation analytics ai rpa iot customer experience datacenter cloud datacenter ai da... |
| 2 | 1 | 3085 | 1_npr_radio_donate_schedule | [npr, radio, donate, schedule, listen, classical, donation, air, wunc, membership] | [meta lean wisdom crowd ai model release search query show search connect u contact u contest rule local host staff prt newsletter public file u contact u contest rule local host staff prt newsletter public file event community calendar public announcement submit event give take community calendar public announcement submit event give take listen listen live podcast directory listen live podcast directory news local regional oklahoma engage stateimpact oklahoma weather traffic npr national n... |
| 3 | 2 | 2459 | 2_ment_cision_entertain_overviewview | [ment, cision, entertain, overviewview, overview, general, resource, gdpr, consumer, transportation] | [launch denver delivery center hire people continue hyper growth data science solution business resource blog journalist log sign data privacy send release news product overview distribution pr newswire cision communication cloud cision ir product contact general inquiry request demo editorial bureau partnership medium inquiry worldwide office search search type field list search result appear automatically update type search content result found please change search term try news focus brow... |
| 4 | 3 | 2058 | 3_patient_healthcare_care_health | [patient, healthcare, care, health, medical, clinical, hospital, doctor, physician, medicine] | [healthcare ai advance rapidly american notice progress venturebeat skip main content event market u venturebeat homepage subscribe artificial intelligence view ai ml deep learn auto ml data label synthetic data conversational ai nlp text speech security view data security privacy network security privacy software security computer hardware security cloud data storage security data infrastructure view data science data management data storage cloud big data analytics data network automation ... |
| 5 | 4 | 2021 | 4_chatgpt_gpt_openai_chatbot | [chatgpt, gpt, openai, chatbot, text, user, write, tab, answer, chatbots] | [us chatgpt incredible way use ai power chatbot chatgpt trendingopenaixq valkyriedodo bird extinctgreen chatgptchatgpt incredible way use ai power chatbotyou heard chatgpt know use use idea get mcfadden feb pm estcreated feb pm image us engineering ipopba istock chatgpt understand generate human like chatbot use handy little internet search monthly paid professional experimental package available advanced language model developed openai cut edge natural language processing capability revolut... |
| 6 | 5 | 1765 | 5_market_analysis_player_growth | [market, analysis, player, growth, forecast, global, key, corporation, size, report] | [global ai social medium market comprehensive study impact analysis business strategy growth share size current well future challenge haitian caribbean news network skip content thursday dec break news global shower gel market analysis share circumstance covid outbreak forecast performance costume market witness huge growth beedpan pierre cardin zara worldwide image recognition market data management supply chain analytics forecasting global search advertising software market late impact ana... |
| 7 | 6 | 1272 | 6_student_teacher_classroom_education | [student, teacher, classroom, education, teach, educator, school, essay, chatgpt, cheat] | [paper exam chatbot ban college seek chatgpt proof assignment skip careslive watchweather day camslive newscastssubmit story ideasubmit photo videosfind call sport blitzfriday night endzonebroncosair forcestats predictionshow watchadvertise usstation jobskktv news carescommunity calendargood news fridayvideomeet teamcontact usmr foodlatest dailytv listingscircle country music lifestylegray dc healthpress releasespaper exam chatbot ban college seek chatgpt proof assignmentslori anne salem ass... |
| 8 | 7 | 1256 | 7_human_bias_could_humanity | [human, bias, could, humanity, ai, extinction, think, intelligence, might, risk] | [golden opportunity shape ai national priority answer vital question future ai humanity get say ai ethic ai lawsubscribe sign inbetathis beta experience may opt click heremore forbesjun edtgenerative ai cybersecurity friend foejun edtyes duty judge court forewarn lawyer potential pitfall use generative ai legal work asks ai ethic ai lawjun edtfostering international collaboration ai healthcarejun edttravel summer fun google chatgpt ai companionsjun edtis cod education know dead may edtthe ri... |
| 9 | 8 | 1078 | 8_newswires_presswire_ein_guinea | [newswires, presswire, ein, guinea, dakota, virginia, carolina, distribution, south, north] | [ai new ai crypto announces launch ein presswire different well work testimonial contact ein presswire news pricing distribution distribution overview medium database major news site tv radio station international newswires newswires industry newswires country newswires state mobile apps newsplugin live feed sample distribution report press release feature industry country state archive newswires international newswires newswires industry agriculture airline automotive banking book publishin... |
3. Wordcloud for representation and representation_doc¶
In [42]:
# Flatten the list of words in each representation into a single string and then join all strings
all_representations = ' '.join([' '.join(repr_list) for repr_list in positive_topic_df['Representation']])
# Create a word cloud
wordcloud_rep = WordCloud(background_color='white').generate(all_representations)
# Plotting
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_rep, interpolation='bilinear')
plt.axis('off')
plt.show()
Representative_Docs (1~11)¶
In [43]:
# Assuming 'Representative_Docs' contains lists of strings
for topic in range(1, 11):
doc_list = positive_topic_df[positive_topic_df['Topic'] == topic]['Representative_Docs'].iloc[0]
if isinstance(doc_list, list):
doc_str = ' '.join(doc_list) # Join list into a single string
else:
doc_str = doc_list # If it's already a string
# Generate word cloud
wordcloud_doc = WordCloud(background_color='white').generate(doc_str)
# Plotting
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_doc, interpolation='bilinear')
plt.title(f"Word Cloud for Topic {topic}")
plt.axis('off')
plt.show()
3.2. Positive sentiment and topic overtime¶
1. Yearly Analysis¶
1. Aggregate Topic Counts Over Time¶
In [44]:
# Count the frequency of each topic
topic_counts = news_po['bert_topics'].value_counts()
# Remove topic -1 and get the top 10 topics
top_10_topics = topic_counts.drop(-1).nlargest(10).index
In [45]:
# Filter the dataset
filtered_news_po = news_po[news_po['bert_topics'].isin(top_10_topics)]
In [46]:
# Group by year and topic, and count occurrences
topic_trends = filtered_news_po.groupby(['year', 'bert_topics']).size().reset_index(name='counts')
2. Pivot the Data for Analysis¶
In [47]:
# Pivot the data
topic_trends_pivot = topic_trends.pivot(index='year', columns='bert_topics', values='counts').fillna(0)
In [48]:
topic_trends_pivot.head()
Out[48]:
| bert_topics | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|---|
| year | ||||||||||
| 2020 | 394 | 37 | 407 | 288 | 32 | 864 | 84 | 118 | 7 | 35 |
| 2021 | 395 | 102 | 566 | 304 | 18 | 837 | 54 | 169 | 75 | 20 |
| 2022 | 457 | 378 | 602 | 453 | 166 | 42 | 80 | 244 | 239 | 75 |
| 2023 | 1863 | 2568 | 884 | 1013 | 1805 | 22 | 1054 | 725 | 757 | 918 |
3. Plot the Trends¶
In [49]:
# Plot
plt.figure(figsize=(12, 6))
for topic in topic_trends_pivot.columns:
plt.plot(topic_trends_pivot.index, topic_trends_pivot[topic], label=f'Topic {topic}')
plt.xlabel('Year')
plt.ylabel('Topic Counts')
plt.title('Top 10 Topic Trends Over Time')
plt.legend()
plt.show()
4. Detailed Analysis¶
In [52]:
# Example: Print representations of the top N topics
top_topics = topic_trends_pivot.sum().sort_values(ascending=False).head(10).index
for topic in top_topics:
print(f"Topic {topic}: {positive_topic_df.loc[positive_topic_df['Topic'] == topic, 'Representation'].iloc[0]}")
Topic 0: ['entrepreneur', 'employee', 'automation', 'data', 'generative', 'worker', 'job', 'business', 'ai', 'enterprise'] Topic 1: ['npr', 'radio', 'donate', 'schedule', 'listen', 'classical', 'donation', 'air', 'wunc', 'membership'] Topic 2: ['ment', 'cision', 'entertain', 'overviewview', 'overview', 'general', 'resource', 'gdpr', 'consumer', 'transportation'] Topic 3: ['patient', 'healthcare', 'care', 'health', 'medical', 'clinical', 'hospital', 'doctor', 'physician', 'medicine'] Topic 4: ['chatgpt', 'gpt', 'openai', 'chatbot', 'text', 'user', 'write', 'tab', 'answer', 'chatbots'] Topic 5: ['market', 'analysis', 'player', 'growth', 'forecast', 'global', 'key', 'corporation', 'size', 'report'] Topic 6: ['student', 'teacher', 'classroom', 'education', 'teach', 'educator', 'school', 'essay', 'chatgpt', 'cheat'] Topic 7: ['human', 'bias', 'could', 'humanity', 'ai', 'extinction', 'think', 'intelligence', 'might', 'risk'] Topic 8: ['newswires', 'presswire', 'ein', 'guinea', 'dakota', 'virginia', 'carolina', 'distribution', 'south', 'north'] Topic 9: ['ago', 'hour', 'bestreviews', 'nexstar', 'story', 'weather', 'newsfeed', 'file', 'video', 'ap']